FakeNewsNet: A Data Repository with News Content, Social

Context and Spatiotemporal Information for Studying Fake News

on Social Media

Kai Shu1, Deepak Mahudeswaran1, Suhang Wang2, Dongwon Lee2 and Huan Liu1

1Arizona State University, Tempe, 85281, USA

2Pennsylvania State University, University Park, PA, 16802, USA

{kai.shu, dmahudes, huan.liu}@asu.edu, {szw494, dongwon}@psu.edu

Abstract

Social media has become a popular means for people to consume and share news.

At the same

time, however, it has also enabled the wide dissemination of fake news, i.e., news with intentionally false

information, causing significant confusions and disruptions on society. To mitigate this problem, the

research of (computational) fake news detection has recently received a lot of attention. Despite several

existing computational solutions on the detection of fake news, however, the lack of comprehensive and

community-driven fake news benchmark datasets has become one of major roadblocks. Not only existing

datasets are scarce, they do not contain a myriad of features often required in the study such as news

content, social context, and spatiotemporal information. Therefore, in this paper, to facilitate fake news

related research, we present a fake news benchmark data repository, named as FakeNewsNet, which con-

tains two comprehensive datasets with diverse features in news content, social context, and spatiotemporal

information. We present a detailed description of the FakeNewsNet, demonstrate an exploratory analy-

sis of two datasets from varying perspectives, and discuss the benefits of the FakeNewsNet for potential

applications on fake news study on social media. The latest version of the FakeNewsNet is available at:

https://github.com/KaiDMML/FakeNewsNet

1

Introduction

Social media has become a primary source of news consumption nowadays. Social media is cost-free, easy

to access, and can fast disseminate posts. Hence, it acts as an excellent way for individuals to post and/or

consume information. For example, the time individuals spend on social media is continually increasing1. As

another example, studies from Pew Research Center shows that around 68% of Americans get some of their

news on social media in 20182 and this has shown a constant increase since 2016. Since there is no regulatory

authority on social media, the quality of news pieces spread in social media is often lower than traditional

news sources. In other words, social media also enables the widespread of fake news. Fake news [18, 31]

means the false information that is spread deliberately to deceive people. Fake news affects the individuals

as well as society as a whole. First, fake news can disturb the authenticity balance of the news ecosystem.

Second, fake news persuades consumers to accept false or biased stories. For example, some individuals and

organizations spread fake news in social media for financial and political gains [1, 2]. It is also reported that

fake news has an influence on the 2016 US presidential elections3. Finally, fake news may cause significant

effects on real-world events. For example, “Pizzagate”, a piece of fake news from Reddit, leads to a real

shooting4. Thus, fake news detection is a critical issue that needs to be addressed.

1https://www.socialmediatoday.com/marketing/how-much-time-do-people-spend-social-media-infographic

2http://www.journalism.org/2018/09/10/news-use-across-social-media-platforms-2018/

3https://www.independent.co.uk/life-style/gadgets-and-tech/news/tumblr-russian-hacking-us-presidential-election-fake-

news-internet-research-agency-propaganda-bots-a8274321.html

4https://www.rollingstone.com/politics/politics-news/anatomy-of-a-fake-news-scandal-125877/

1

Detecting fake news on social media presents unique challenges. First, fake news pieces are intentionally

written to mislead consumers, which makes it not satisfactory to spot fake news from news content itself.

Thus, we need to explore information in addition to news content, such as user engagements and social

behaviors of users on social media. For example, a credible user’s comment that “This is fake news” is

a strong signal that the news may be fake. Second, the research community lacks datasets which contain

spatiotemporal information to understand how fake news propagates over time in different regions, how users

react to fake news, and how we can extract useful temporal patterns for (early) fake news detection and

intervention. Thus, it is necessary to have comprehensive datasets that have news content, social context

and spatiotemporal information to facilitate fake news research. However, to the best of our knowledge,

existing datasets only cover one or two aspects.

Therefore, in this paper, we construct and publicize a multi-dimensional data repository FakeNewsNet5,

which currently contains two datasets with news content, social context, and spatiotemporal information.

The dataset is constructed using an end-to-end system, FakeNewsTracker6 [27]. The constructed FakeNews-

Net repository has the potential to boost the study of various open research problems related to fake news

study. First, the rich set of features in the datasets provides an opportunity to experiment with different

approaches for fake new detection, understand the diffusion of fake news in social network and intervene in

it. Second, the temporal information enables the study of early fake news detection by generating synthetic

user engagements from historical temporal user engagement patterns in the dataset [15]. Third, we can in-

vestigate the fake news diffusion process by identifying provenances, persuaders, and developing better fake

news intervention strategies [21]. Our data repository can serve as a starting point for many exploratory

studies for fake news, and provide a better, shared insight into disinformation tactics. We aim to continu-

ously update this data repository, expand it with new sources and features, as well as maintain completeness.

The main contributions of the paper are:

We construct and publicize a multi-dimensional data repository for various facilitating fake news de-

tection related researches such as fake news detection, evolution, and mitigation;

We conduct an exploratory analysis of the datasets from different perspectives to demonstrate the

quality of the datasets, understand their characteristics and provide baselines for future fake news

detection; and

We discuss benefits and provides insight for potential fake news studies on social media with Fake-

NewsNet.

2

Background and Related Work

Fake news detection in social media aims to extract useful features and build effective models from existing

social media datasets for detecting fake news in the future. Thus, a comprehensive and large-scale dataset

with multi-dimension information in online fake news ecosystem is important. The multi-dimension infor-

mation not only provides more signals for detecting fake news but can also be used for researches such as

understanding fake news propagation and fake news intervention. Though there exist several datasets for fake

news detection, the majority of them only contains linguistic features. Few of them contains both linguistic

and social context features. To facilitate research on fake news, we provide a data repository which includes

not only news contents and social contents, but also spatiotemporal information. For a better comparison

of the differences, we list existing popular fake news detection datasets below and compare them with the

FakeNewsNet repository in Table 1.

BuzzFeedNews7: This dataset comprises a complete sample of news published in Facebook from 9

news agencies over a week close to the 2016 U.S. election from September 19 to 23 and September 26

and 27. Every post and the linked article were fact-checked claim-by-claim by 5 BuzzFeed journalists.

It contains 1,627 articles –826 mainstream, 356 left-wing, and 545 right-wing articles.

5https://github.com/KaiDMML/FakeNewsNet

6http://blogtrackers.fulton.asu.edu:3000/#/about

7https://github.com/BuzzFeedNews/2016-10-facebook-fact-check/tree/master/data

2

LIAR8: This dataset [26] is collected from fact-checking website PolitiFact. It has 12.8 K human

labeled short statements collected from PolitiFact and the statements are labeled into six categories

ranging from completely false to completely true as pants on fire, false, barely-true, half-true, mostly

true, and true.

BS Detector9: This dataset is collected from a browser extension called BS detector developed for

checking news veracity. It searches all links on a given web page for references to unreliable sources by

checking against a manually compiled list of domains. The labels are the outputs of the BS detector,

rather than human annotators.

CREDBANK10: This is a large-scale crowd-sourced dataset [13] of around 60 million tweets that

cover 96 days starting from Oct. 2015. The tweets are related to over 1,000 news events. Each event

is assessed for credibilities by 30 annotators from Amazon Mechanical Turk.

BuzzFace11: This dataset [17] is collected by extending the BuzzFeed dataset with comments related

to news articles on Facebook. The dataset contains 2263 news articles and 1.6 million comments.

FacebookHoax12: This dataset [23] comprises information related to posts from the facebook pages

related to scientific news (non- hoax) and conspiracy pages (hoax) collected using Facebook Graph

API. The dataset contains 15,500 posts from 32 pages (14 conspiracy and 18 scientific) with more than

2,300,000 likes.

We provide a comparison in Table 1 to show that no existing public datasets provide all features of

news content, social context, and spatiotemporal information. Existing datasets have some limitations that

FakeNewsNet addresses. For example, BuzzFeedNews only contains headlines and text for each news piece

and covers news articles from very few news agencies.

LIAR dataset contains mostly short statements

instead of entire news articles with meta attributes. BS Detector data is collected and annotated by using

a developed news veracity checking tool, rather than using human expert annotators. CREDBANK dataset

was originally collected for evaluating tweet credibilities and the tweets in the dataset are not related to

the fake news articles and hence cannot be effectively used for fake news detection. BuzzFace dataset has

basic news contents and social context information but it does not capture the temporal information. The

FacebookHoax dataset consists very few instances about conspiracy theories and scientific news.

To address the disadvantages of existing fake news detection datasets, the proposed FakeNewsNet reposi-

tory collects multi-dimension information from news content, social context, and spatiotemporal information

from different types of news domains such as political and entertainment sources.

Table 1: Comparison with existing fake news detection datasets

Dataset

Features

News Content

Social Context

Spatiot. Information

Linguistic

Visual

User

Post

Response

Network

Spatial

Temporal

BuzzFeedNews



LIAR



BS Detector



CREDBANK











BuzzFace









FacebookHoax









FakeNewsNet

















8https://www.cs.ucsb.edu/ william/software.html

9https://github.com/bs-detector/bs-detector

10http://compsocial.github.io/CREDBANK-data/

11https://github.com/gsantia/BuzzFace

12https://github.com/gabll/some-like-it-hoax

3

Figure 1: The flowchart of dataset integration process for FakeNewsNet. It mainly describes the collection

of news content, social context and spatiotemporal information.

3

Dataset Integration

In this section, we introduce a process that integrates datasets to construct the FakeNewsNet repository.

We demonstrate (see Figure 1) how we can collect news contents with reliable ground truth labels, how we

obtain additional social context and spatiotemporal information.

3.1

News Content

To collect reliable ground truth labels for fake news, we utilize fact-checking websites to obtain news contents

for fake news and true news such as PolitiFact13 and GossipCop14. In PolitiFact, journalists and domain

experts review the political news and provide fact-checking evaluation results to claim news articles as fake15

or real16. We utilize these claims as ground truths for fake and real news pieces. In PolitiFact’s fact-checking

evaluation result, the source URLs of the web page that published the news articles are provided, which can

be used to fetch the news contents related to the news articles. In some cases, the web pages of source news

articles are removed and are no longer available. To tackle this problem, we i) check if the removed page

was archived and automatically retrieve content at the Wayback Machine17; and ii) make use of Google web

search in automated fashion to identify news article that is most related to the actual news. GossipCop is a

website for fact-checking entertainment stories aggregated from various media outlets. GossipCop provides

rating scores on the scale of 0 to 10 to classify a news story as the degree from fake to real. From our

observation, almost 90% of the stories from GossipCop have scores less than 5, which is mainly because the

purpose of GossipCop is to showcase more fake stories. In order to collect true entertainment news pieces,

we crawl the news articles from E! Online18, which is a well-known trusted media website for publishing

entertainment news pieces. We consider all the articles from E! Online as real news sources. We collect all

the news stories from GossipCop with rating scores less than 5 as the fake news stories.

Since GossipCop does not explicitly provide the URL of the source news article, so similarly we search

the news headline in Google or archive to obtain the news source information. The headlines of GossipCop

13https://www.politifact.com/

14https://www.gossipcop.com/

15https://www.politifact.com/subjects/fake-news/

16https://www.politifact.com/truth-o-meter/rulings/true/

17https://archive.org/web/

18https://www.eonline.com/

4

news articles are generally written to reflect the fact and so may not be used directly. For example, one of

the headlines, “Jennifer Aniston NOT Wearing Brad Pitts Engagement Ring, Despite Report” mentions the

fact instead of the original news articles title. We utilize some heuristics to extract proper headlines such

as i) using the text in quoted string; ii) removing negative sentiment words. For example, some headlines

include quoted strings which are exact text from the original news source. In this case, we extract the

named entities from the headline using CoreNLP tool [12] and quoted strings to form the search query. For

example, in headline Jennifer Aniston, Brad Pitt NOT “Just Married” Despite Report, we extract named

entities including Jennifer Aniston, Brad Pitt and quoted strings including Just Married and form the search

query as “Jennifer Aniston Brad Pitt Just Married” because the quoted text in addition with named entities

mostly provides the context of the original news. As another example, the headlines are written in the

negative sense to correct the false information, e.g., “Jennifer Aniston NOT Wearing Brad Pitts Engagement

Ring, Despite Report”. So we remove negative sentiment words retrieved from SentiWordNet[3] and some

hand-picked words from the headline to form the search query, e.g., “Jennifer Aniston Wearing Brad Pitts

Engagement Ring”.

Table 2: Statistics of the FakeNewsNet repository

Type

Features

PolitiFact

GossipCop

Fake

Real

Fake

Real

News

Content

Linguistic

# News articles

432

624

5,323

16,817

# News articles with text

420

528

4,947

16,694

Visual

# News articles with images

336

447

1,650

16,767

Social

Context

User

# Users posting tweets

95,553

249,887

265,155

80,137

# Users involved in likes

113,473

401,363

348,852

145,078

# Users involved in retweets

106,195

346,459

239,483

118,894

# Users involved in replies

40,585

18,6675

106,325

50,799

Post

# Tweets posting news

164,892

399,237

519,581

876,967

Response

# Tweets with replies

11,975

41,852

39,717

11,912

# Tweets with likes

31692

93,839

96,906

41,889

# Tweets with retweets

23,489

67,035

56,552

24,955

Network

# Followers

405,509,460

1,012,218,640

630,231,413

293,001,487

# Followees

449,463,557

1,071,492,603

619,207,586

308,428,225

Average # followers

1299.98

982.67

1020.99

933.64

Average # followees

1440.89

1040.21

1003.14

982.80

SpaTemp.

Infor.

Spatial

# User profiles with locations

217,379

719,331

429,547

220,264

# Tweets with locations

3,337

12,692

12,286

2,451

Temporal

# Timestamps for news

296

167

3,558

9,119

# Timestamps for response

171,301

669,641

381,600

200,531

3.2

Social Context

The user engagements related to the fake and real news pieces from fact-checking websites are collected using

search API provided by social media platforms such as the Twitter’s Advanced Search API19. The search

queries for collecting user engagements are formed from the headlines of news articles, with special characters

removed from the search query to filter out the noise. We search for tweets using queries containing all the

words in the headline to ensure the relevance of the resultant tweets. In addition, the URLs mentioned in

the tweets collected are further used as search queries to collect additional tweets, so that we try to reduce

the bias of data collection only using keywords. After we obtain the social media posts that directly spread

news pieces, we further fetch the user response towards these posts such as replies, likes, and reposts. In

addition, when we obtain all the users engaging in news dissemination process, we collect all the metadata

for user profiles, user posts, and the social network information.

3.3

Spatiotemporal Information

The spatiotemporal information includes spatial and temporal information.

For spatial information, we

obtain the locations explicitly provided in user profiles. The temporal information indicates that we record

the timestamps of user engagements, which can be used to study how fake news pieces propagate on social

media, and how the topics of fake news are changing over time. Since fact-checking websites periodically

19https://twitter.com/search-advanced?lang=en

5

(a) PolitiFact Fake News

(b) PolitiFact Real News

(c) GossipCop Fake News

(d) GossipCop Real News

Figure 2: The word cloud of new body text for fake and real news on PolitiFact and GossipCop.

update newly coming news articles, so we dynamically collect these newly added news pieces and update

the FakeNewsNet repository as well. In addition, we keep collecting the user engagements for all the news

pieces periodically in the FakeNewsNet repository such as the recent social media posts, and second order

user behaviors such as replies, likes, and retweets. For example, we run the news content crawler and update

Tweet collector per day. The spatiotemporal information provides useful and comprehensive information for

studying fake news problem from a temporal perspective.

4

Data Analysis

FakeNewsNet has multi-dimensional information related to news content, social context, and spatiotemporal

information. In this section, we first provide some preliminary quantitative analysis to illustrate the features

of FakeNewsNet. We then perform fake news detection using several state-of-the-art models to evaluate the

quality of the FakeNewsNet repository. The detailed statistics of FakeNewsNet repository is illustrated in

Table 2.

4.1

Assessing News Content

Since fake news attempts to spread false claims in news content, the most straightforward means of detecting

it is to find clues in a news article. First, we analyze the topic distribution of fake and real news articles.

From figures 2(a) and 2(b), we can observe that the fake and real news of the PolitiFact dataset is mostly

related to the political campaign. In case of GossipCop dataset from figures 2(c) and 2(d), we observe that

the fake and real news are mostly related to gossip about the relationship among celebrities. In addition, we

can see the topics for fake news and real news are slightly different in general. However, for specific news, it

is difficult to only use topics in the content to detect fake news [18], which necessitates the need to utilize

other auxiliary information such as social context.

We also explore the distribution of publishers who publish fake news on both datasets. We find out that

there are in total 301 publishers publishing 432 fake news pieces, among which 191 of all publishers only

publish 1 piece of fake news, and 40 publishers publish at least 2 pieces of fake news such as theglobalhead-

lines.net and worldnewsdailyreport.com. For Gossipcop, there are in total 209 publishers publishing 6,048

fake news pieces, among which 114 of all publishers only publish 1 piece of fake news, and 95 publishers

publish at least 2 pieces of fake news such as hollywoodlife.com and celebrityinsider.org. The reason may be

that these fact-checking websites try to identify those check-worthy breaking news events regardless of the

publishers, and fake news publishers can be shut down after they were reported to publish fake news pieces.

6

4.2

Comparing Social Contexts of Fake and Real News

Social context represents the news proliferation process over time, which provides useful auxiliary information

to infer the veracity of news articles. Generally, there are three major aspects of the social media context

that we want to represent: user profiles, user posts, and network structures. Next, we perform an exploratory

study of these aspects on FakeNewsNet and introduce the potential usage of these features to help fake news

detection.

4.2.1

User Profiles

(a) PolitiFact dataset

(b) GossipCop dataset

Figure 3: The distribution of user profile creation dates on PolitiFact and GossipCop datasets

User profiles on social media have been shown to be correlated with fake news detection [22]. Research

has also shown that fake news pieces are likely to be created and spread by non-human accounts, such as

social bots or cyborgs [18, 20]. We will illustrate some user profile features in FakeNewsNet repository.

First, we explore whether the creation time of user accounts for fake news and true news is different or

not. We compute time ranges of account register time with the current date and the results are shown in

Figure 3. We can see that the account creation time distribution of users posting fake news is significantly

different from those who post real news, with the p-value< 0.05 under t-test. Also, we notice that it’s not

necessary that users with an account created long time or shorter time post fake/real news more often.

For example, the mean creation time for users posting fake news (2214.09) is less than that for real news

(2166.84) in Politifact; while we see opposite case in Gossipcop dataset.

Figure 4: A comparison of bot scores on users related to fake and real news on PolitiFact dataset.

Next, we take a deeper look into the user profiles and assess the social bots effects. We randomly selected

10,000 users who posted fake and real news and performed bot detection using Botometer [5], one of the

7

(a) PolitiFact dataset

(b) GossipCop dataset

Figure 5: Ternary plots of the ratio of the positive, neutral and negative sentiment replies for fake and real

news.

state-of-the-art bot detection algorithm. Botometer20 takes Twitter username as input and utilizes various

features extracted from meta-data and outputs a probability score in [0, 1], indicating how likely the user

is a bot. We set the threshold of 0.5 on the bot score returned from the Botometer results to determine

bot accounts. Figure 4 shows the ratio of the bot and human users involved in tweets related to fake and

real news. We can see that bots are more likely to post tweets related to fake news than real users. For

example, almost 22% of users involved in fake news are bots, while only around 9% of users are predicted as

bot users for real news. Similar results were observed with different thresholds on bot scores based on both

datasets. This indicates that there are bots in Twitter for spreading fake news, which is consistent with the

observation in [20]. In addition, most users that spread fake news (around 78%) are still more likely to be

humans than bots (around 22%), which is also in consistence with the findings in [24].

4.2.2

Post and Response

People express their emotions or opinions towards fake news through social media posts, such as skeptical

opinions, sensational reactions, etc. These features are important signals to study fake news and disinfor-

mation in general [9, 14].

We perform sentiment analysis on the replies of user posts that spread fake news and real news using

one of the state-of-the-art unsupervised sentiment prediction tool called VADER21 [8]. Figure 5 shows the

relationship between positive, neutral and negative replies for all news articles. For each news piece, we

obtain all the replies for this news piece and predict the sentiment as positive, negative, or neutral. Then

we calculate the ratio of positive, negative, and neutral replies for the news. For example, if a news piece

has the sentiment distribution of replies as [0.5, 0.5, 0.5], it occurs in the middle of the very center of the

triangle in Figure 5(a). We can also see that the real news have more number of neutral replies over positive

and negative replies whereas fake articles have a bigger ratio of negative sentiments. In case of sentiment

of the replies of the GossipCop dataset shown in Figure 5(b), we cannot observe any significant differences

between fake and real news. This could be because of the difficulty in identifying fake and real news related

to entertainment by common people.

We analyze the distribution of likes, retweets, and replies of tweets, which can help gain insights on user

interaction related to fake and real news. Social science studies have theorized the relationship between user

behaviors and their perceived beliefs on the information on social media [10]. For example, the behaviors of

likes and retweets are more emotional while replies are more rational.

We plot the ternary triangles which illustrate the ratio of replies, retweets, and likes from the second

order engagements towards the posts that spread fake news or real news pieces. From Figure 6, we observe

that the: i) fake news pieces tend to have fewer replies and more retweets; ii) Real news pieces have more

ratio of likes than fake news pieces, which may indicate that users are more likely to agree on real news. The

differences in the distribution of user behaviors between fake news and real news have potentials to study

20https://botometer.iuni.iu.edu/

21https://github.com/cjhutto/vaderSentiment

8

(a) PolitiFact dataset

(b) GossipCop dataset

Figure 6: Ternary plots of the ratio of likes, retweet and reply of tweets related to fake and real news

(a) Follower count of users in PolitiFact

dataset

(b) Followee count of users in PolitiFact

dataset

(c) Follower count of users in GossipCop

dataset

(d) Followee count of users in GossipCop

dataset

Figure 7: The distribution of the count of followers and followees related to fake and real news

users’ beliefs characteristics. FakeNewsNet provides real-world datasets to understand the social factors of

user engagements and underlying social science as well.

4.2.3

Networks

Users tend to form different networks on social media in terms of interests, topics, and relations, which serve

as the fundamental paths for information diffusion [18]. Fake news dissemination processes tend to form an

echo chamber cycle, highlighting the value of extracting network-based features to represent these types of

network patterns for fake news detection [6].

We look at the social network statistics of all the users that spread fake news or real news. The social

network features such as followers count and followee count can be used to estimate the scope of how the

fake news can spread in social media. We plot the distribution of follower count and followee count of users

in Figure 7. We can see that: i) the follower and followee count of the users generally follows power law

9

distribution, which is commonly observed in social network structures; ii) there is a spike in the followee

count distribution of both users and this is because of the restriction imposed by Twitter22 on users to have

at most 5000 followees when the number of following is less than 5000.

4.3

Characterizing Spatiotemporal Information

(a) Temporal user engagements of fake news

(b) Temporal user engagements of real news

Figure 8: The comparison of temporal user engagements of fake and real news

Recent research has shown users’ temporal responses can be modeled using deep neural networks to

help detection fake news [16], and deep generative models can generate synthetic user engagements to help

early fake news detection [11]. The spatiotemporal information in FakeNewsNet depicts the temporal user

engagements for news articles, which provides the necessary information to further study the utility of using

spatiotemporal information to detect fake news.

First, we investigate if the temporal user engagements such as posts, replies, retweets, are different for

fake news and real news with similar topics, e.g., fake news “TRUMP APPROVAL RATING Better than

Obama and Reagan at Same Point in their Presidencies” from June 9, 2018 to 13 June, 2018 and real news

President Trump in Moon Township Pennsylvania” from March 10, 2018 to 20 March, 2018. As shown in

Figure 8, we can observe that: i) for fake news, there is a sudden increase in the number of retweets and it

does remain constant beyond a short time whereas, in the case of real news, there is a steady increase in

the number of retweets; ii) Fake news pieces tend to receive fewer replies than real news. We have similar

observations in Table 2, and replies count for 5.76% among all tweets for fake news, and 7.93% for real

news. The differences of diffusion patterns for temporal user engagements have the potential to determine

the threshold time for early fake news detection. For example, if we can predict the sudden increase of user

engagements, we should use the user engagements before the time point and detect fake news accurately to

limit the affect size of fake news spreading [21].

Next, we demonstrate the geo-location distribution of users engaging in fake and real news (See Figure 9

for Politifact dataset). We show the locations explicitly provided by users in their profiles, and we can see

that users in the PolitiFact dataset who posting fake news have a different distribution than those posting real

news. Since it is usually sparse of locations provided by users explicitly, we can further consider the location

information attached with Tweets, and even utilize existing approaches for inferring the locations [28]. It

would be interesting to explore how users are geo-located distributes using FakeNewsNet repository from

different perspectives.

4.4

Fake News Detection Performance

In this subsection, we utilize the PolitiFact and GossipCop datasets from FakeNewsNet repository to perform

fake news detection. We use 80% of data for training and 20% for testing. For evaluation, we use accuracy

and F1 score.

News content: To evaluate the news contents, the text contents from source news articles are rep-

resented as a one-hot encoded vector and then we apply standard machine learning models including

support vector machines (SVM), logistic regression (LR), Naive Bayes (NB), and CNN. For SVM, LR,

22https://help.twitter.com/en/using-twitter/twitter-follow-limit

10

(a) Spatial distribution for fake news

(b) Spatial distribution for real news

Figure 9: Spatial distribution of users posting tweets related to fake and real news in PolitiFact dataset.

Table 3: Fake news detection performance on FakeNewsNet

Model

PolitiFact

GossipCop

Acc.

F1

Acc.

F1

SVM

0.580

0.659

0.497

0.595

LR

0.642

0.633

0.648

0.646

NB

0.617

0.651

0.624

0.649

CNN

0.629

0.583

0.723

0.725

SAF /S

0.654

0.681

0.689

0.703

SAF /A

0.667

0.619

0.635

0.706

SAF

0.691

0.706

0.689

0.717

and NB, we used the default settings provided in the scikit-learn and do not tune parameters. For

CNN we use the standard implementation with default setting23. We also evaluate the classification

of news articles using Social article fusion (SAF /S) [27] model that utilizes auto-encoder for learning

features from news articles to classify new articles as fake or real.

Social context: In order to evaluate the social context, we utilize the variant of SAF model [27], i.e.,

SAF /A, which utilize the temporal pattern of the user engagements to detect fake news.

News content and social context: Social Article Fusion(SAF) model that combines SAF /S and

SAF /A is used. This model uses autoencoder with LSTM cells of 2 layers for encoder as well as

decoder and also temporal pattern of the user engagements are also captured using another network

of LSTM cells with 2 layers.

The experimental results are shown in Table 3. We can see that: i) Among news content-based methods,

SAF /S perform better in terms of accuracy and F1 score in most cases. SAF /A provides a similar result

around 66.7% accuracy as SAF /S. The compared baselines models provide reasonably good performance

results for the fake news detection where accuracy is mostly around 65% on PolitiFact; ii) we observe that

SAF relatively achieves better accuracy than both SAF /S and SAF /A for both dataset. For example, SAF

has around 5.65% and 3.60% performance improvement than SAF /S and SAF /A on PolitiFact in terms of

Accuracy. This indicates that user engagements can help fake news detection in addition to news articles on

PolitiFact dataset.

In summary, FakeNewsNet provides multiple dimensions of information that has the potential to benefit

researchers to develop novel algorithms for fake news detection.

5

Data Structure

In this section, we describe in details of the structure of FakeNewsNet. We will introduce the data format

and provide API interfaces that allows for efficient downloading of dataset under the policy of social media

platforms.

23https://github.com/dennybritz/cnn-text-classification-tf

11

5.1

API Interfaces

The full dataset is massive and the actual content cannot be directly distributed because of Twitter’s sharing

policy24. The dataset25 is referenced using DOI 26 and adheres FAIR Data Principles 27. The APIs are

provided in the form of multiple Python scripts which are well-documented and CSV file with news content

URLs and associated tweet ids are provided as well. In order to initiate the download, the user need to

simply run the main.py file with the required configuration. The APIs make use of Twitter Access tokens

to fetch information related to tweets. These APIs can help to download specific subsets of dataset such

as linguistic content, tweet information, retweet information, user information and social network. Since

Twitter does not provide APIs to download replies and likes of tweets, web scrapping tools can be used.

5.2

Data Format

The news pieces from different platforms/domains are stored in different directories. For example, gossip-

cop/fake directory will contain fake news samples from GossipCop dataset. Each directory will possess the

associated auto-generated news ID as its name and contain the following structure: news article.json file,

tweets folder, retweets folder, replies folder, and likes folder.

news article.json includes all the meta information of the news articles collected using the provided

news source URLs. This is a JSON object with attributes including:

text is the text of the body of the news article.

images is a list of the URLs of all the images in the news article web page.

publish date indicates the publishing date of that article.

tweets folder contains the metadata of the list of tweets associated with the news article. Each file in

this folder contains the tweet objects returned by the Twitter API.

retweets folder includes a list of files containing the retweets of tweets posting the news articles. Each

file is named as <tweet id>.json and have a list of retweet objects collected using Twitter API.

replies folder contains files including replies and conversation threads of tweets sharing the news such

as reply text, user details and reply timestamps.

likes folder comprises files containing a list of IDs for users who have liked each of the tweets sharing

the news article.

In addition, we store the meta data of all users including profiles, historical tweets, followers, followees

through the following folders. Each of the these folders contains files named as <user id>.json indicating a

particular user details. Note that we only show the meta of 5000 users in the provided link due to the space

limitation.

user profiles folder includes files containing all the metadata of the users in the dataset. Each file

is this directory is a JSON object collected from Twitter API containing information about the user

including profile creation time, geolocation of the user, profile image URL, followers count, followees

count, number of tweets posted and number of tweets favorited.

user timeline tweets folder includes JSON files containing the list of at most 200 recent tweets posted

by the user. This includes the complete tweet object with all information related to tweet.

user followers folder includes JSON files containing a list of user IDs of users following a particular

user.

user following folder includes JSON files containing a list of user IDs a particular user follows.

24https://developer.twitter.com/en/developer-terms/agreement-and-policy

25To access the dataset, we have published code implementation available at https://github.com/KaiDMML/FakeNewsNet

that allows the users to download specific subsets of data.

26https://doi.org/10.7910/DVN/UEMMHS

27https://www.force11.org/group/fairgroup/fairprinciples

12

6

Potential Applications

FakeNewsNet contains information from multi-dimensions which could be useful for many applications. We

believe FakeNewsNet would benefit the research community for studying various topics such as: (early) fake

news detection, fake news evolution, fake news mitigation, malicious account detection.

6.1

Fake News Detection

One of the challenges for fake news detection is the lack of labeled benchmark dataset with reliable ground

truth labels and comprehensive information space, based on which we can capture effective features and

build models. FakeNewsNet can help the fake news detection task because it has reliable labels annotated

by journalists and domain experts, and multi-dimension information from news content, social context, and

spatiotemporal information.

First, news contents are the fundamental sources to find clues to differentiate fake news pieces. For

example, studies have shown that news contents can be modeled with tensor embedding in a semi-supervised

or unsupervised manner to detect fake news [30, 32]. In addition, news representation can be obtained with

deep neural networks to improve fake news detection [33, 35]. In FakeNewsNet, we provide various attributes

of news articles such as publishers, headlines, body texts, and images/videos. This information can be used

to extract different linguistic features and visual features to further build detection models for clickbaits

or fake news.

Since we directly collect news articles from fact-checking websites such as PolitiFact and

GossipCop, we provide detailed explanations from the fact-checkers, which are useful to learn common and

specific perspectives of in what aspects the fake news pieces are formed.

Second, user engagements represent the news proliferation process over time, which provides useful aux-

iliary information to infer the veracity of news articles [34]. Generally, there are three major aspects of the

social context: users, generated posts, and networks. Since fake news pieces are likely to be created and

spread by non-human accounts, such as bots [20]. Thus, capturing users’ profiles and characteristics can

provide useful information for fake news detection. Also, people express their emotions or opinions towards

fake news through social media posts and thus we collect all the user posts for news pieces, as well as en-

gagements such as reposts, comments, likes, which can be utilized to extract abundant features to captures

fake news patterns. Moreover, fake news dissemination processes tend to form an echo chamber cycle, high-

lighting the value of extracting network-based features to represent these types of network patterns for fake

news detection. We provide a large-scale social network of all the users involving in the news dissemination

process.

Third, early fake news detection aims to give early alerts of fake news during the dissemination process

before it reaches a broad audience [11]. Therefore early fake news detection methods are highly desirable

and socially beneficial. For example, capturing the pattern of user engagements in the early phases could

be helpful to achieve the goal of unsupervised detection. Recent approaches utilize advanced deep gener-

ative models to generate synthetic user comments to help improve fake news detection performance [15].

FakeNewsNet contains all these types of information, which provides potentials to further explore early fake

news detection models. In addition, FakeNewsNet contains two datasets of different domains, i.e., political

and entertainment, which can help to study common and different patterns for fake news under different

topics. Moreover, being able to explain prediction results is important for decision makers to mitigate fake

news, and FakeNewsNet has multi-source of signals which can be exploited as explainable factors (e.g., user

comments) [25].

6.2

Fake News Evolution

The fake news diffusion process also has different stages in terms of people’s attention and reactions as

time goes by, resulting in a unique life cycle. For example, breaking news and in-depth news demonstrate

different life cycles in social media [4], and social media reactions can help predict future visitation patterns

of news pieces accurately even at an early stage. We can have a deeper understanding of how particular

stories “go viral” from normal public discourse by studying the fake news evolution process. First, tracking

the life cycle of fake news on social media requires recording essential trajectories of fake news diffusion in

general [19]. Thus, FakeNewsNet has collected the related temporal user engagements which can keep track

13

of these trajectories. Second, for a specific news event, the related topics may keep changing over time and

be diverse for fake news and real news. FakeNewsNet is dynamically collecting associated user engagements

and allows us to perform comparison analysis (e.g., see Figure 8), and further investigate distinct temporal

patterns to detect fake news [16]. Moreover, statistical time series models such as temporal point process

can be used to characterize different stages of user activities of news engagements [7]. FakeNewsNet enables

the temporal modeling from real-world datasets, which is otherwise impossible from synthetic datasets.

6.3

Fake News Mitigation

Fake news mitigation aims to reduce the negative effects brought by fake news. During the spreading process

of fake news, users play different roles such as provenances: the sources or originators who publish fake news

pieces; persuaders: who spread fake news with supporting opinions; and clarifiers: who propose skeptically

and opposing viewpoints towards fake news and try to clarify them. Identifying key users on social media

is important to mitigate the effect of fake news [29]. For example, provenances can help answer questions

such as whether the piece of news has been modified during its propagation. In addition, it is necessary to

identify influential persuaders to limit the spread scope of fake news by blocking the information flow from

them to their followers [21]. FakeNewsNet provides rich information about users who post, like, comment

on fake and real news pieces (see Figure 6), which enables the exploration of identifying different types of

users.

To mitigate the effect of fake news, network intervention aims to develop strategies to control the

widespread dissemination of fake news before it goes viral. Two major strategies of network intervention

are: i) Influence Minimization: minimizing the spread scope of fake news during dissemination process; ii)

Mitigation Campaign: Limiting the impact of fake news by maximizing the spread of true news. FakeNews-

Net allows researchers to build a diffusion network with spatiotemporal information and can facilitate the

deep understanding of minimizing the influence scopes. Furthermore, we may able to identify the fake news

and real news pieces for a specific event from FakeNewsNet and study the effect of mitigation campaigns in

real-world datasets.

6.4

Malicious Account Detection

Studies have shown that malicious accounts that can amplify the spread of fake news include social bots,

trolls, and cyborg users. Social bots can give a false impression that information is highly popular and

endorsed by many people, which enables the echo chamber effect for the propagation of fake news. We can

study the nature of users who spread fake news and identify the characteristics of bot accounts used in fake

news diffusion process through FakeNewsNet. Using features like user profile metadata and historical tweets

of users who spread fake news along with social network one could analyze the differences in characteristics

of users to clusters them as malicious or not. Through a preliminary study in Figure 4, we have shown that

bot users are more likely to exist in the fake news spreading process. Although existing works have studied

bot detection in general, but few studies investigate the influences of social bots for fake news spreading.

FakeNewsNet could potentially facilitate the study of understanding the relationship between fake news and

social bots, and further, explore the mutual benefits of studying fake news detection or bot detection.

7

Conclusion and Future Work

In this paper, we provide a comprehensive repository FakeNewsNet which contains news content, social con-

text, and spatiotemporal information. We propose a principled strategy to collect relevant data from different

sources. Moreover, we perform a preliminary exploration study on various features on FakeNewsNet and

demonstrate its utility through fake news detection task over several state-of-the-art baselines. FakeNewsNet

has the potential to facilitate many promising research directions such as fake news detection, mitigation,

evolution, malicious account detection, etc.

There are several interesting options for future work. First, FakeNewsNet repository can be extended

to other reliable news sources such as other fact-checking websites or curated data collections. Second, the

selection strategy can be used for web search results to reduce noise in the data collection process. Third,

14

FakeNewsNet repository can be integrated with front-end softwares and build an end-to-end system for fake

news study.

Ackowledgments

This material is in part supported by the NSF awards #1909555, #1614576, #1742702, #1820609, and

#1915801.

References

[1] Srijan Kumar, and Neil Shah. 2018. False information on web and social media: A survey. In arXiv

preprint arXiv:1804.08559(2018)

[2] Srijan Kumar, Robert West, and Jure Leskovec. 2016. Disinformation on the web: Impact, characteristics,

and detection of wikipedia hoaxes. In WWW’16.

[3] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet3.0: an enhanced lexical

resource for sentiment analysis and opinion mining. In Lrec, Vol. 10. 2200–2204.

[4] Carlos Castillo, Mohammed El-Haddad, J¨urgen Pfeffer, and Matt Stempeck. Characterizing the life cycle

of online news stories using social media reac-tions. In CHI’14.

[5] Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Fil-ippo Menczer. Botornot:

A system to evaluate social bots. In WWW’16.

[6] Michela Del Vicario, Gianna Vivaldo, Alessandro Bessi, Fabiana Zollo, AntonioScala, Guido Caldarelli,

and Walter Quattrociocchi. 2016. Echo chambers: Emo-tional contagion and group polarization on face-

book.Scientific reports6 (2016),37825.

[7] Mehrdad Farajtabar, Jiachen Yang, Xiaojing Ye, Huan Xu, Rakshit Trivedi, EliasKhalil, Shuang Li, Le

Song, and Hongyuan Zha. 2017. Fake news mitigation viapoint process based intervention. arXiv preprint

arXiv:1703.07823(2017).

[8] CJ Hutto Eric Gilbert. Vader: A parsimonious rule-based model for sen-timent analysis of social media

text. In ICWSM’14.

[9] Zhiwei Jin, Juan Cao, Yongdong Zhang, and Jiebo Luo. News Verification by Exploiting Conflicting

Social Viewpoints in Microblogs. In AAAI’16.

[10] Antino Kim and Alan R Dennis. 2017. Says Who?: How News PresentationFormat Influences Perceived

Believability and the Engagement Level of SocialMedia Users. (2017).

[11] Yang Liu and Yi-fang Brook Wu. Early Detection of Fake News on SocialMedia Through Propagation

Path Classification with Recurrent and Convolu-tional Networks. In AAAI’18.

[12] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, StevenBethard, and David McClosky.

2014. The Stanford CoreNLP natural languageprocessing toolkit. In ACL’14. 55–60.

[13] Tanushree Mitra and Eric Gilbert. CREDBANK: A Large-Scale SocialMedia Corpus With Associated

Credibility Annotations. In ICWSM’15.

[14] Vahed Qazvinian, Emily Rosengren, Dragomir R Radev, and Qiaozhu Mei. [n.d.]. Rumor has it: Iden-

tifying misinformation in microblogs. In EMNLP’11.

[15] Feng Qian, ChengYue Gong, Karishma Sharma, and Yan Liu. Neural User Response Generator: Fake

News Detection with Collective User Intelligence.. In IJCAI’18.

[16] Natali Ruchansky, Sungyong Seo, and Yan Liu. Csi: A hybrid deep modelfor fake news detection. In

CIKM’17.

15

[17] Giovanni C Santia and Jake Ryland Williams. BuzzFace: A News VeracityDataset with Facebook User

Commentary and Egos. In ICWSM’18.

[18] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news detection on social

media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19, 1 (2017), 22–36.

[19] Chengcheng Shao, Giovanni Luca Ciampaglia, Alessandro Flammini, and Fil-ippo Menczer. Hoaxy: A

platform for tracking online misinformation. In WWW’16.

[20] Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Alessandro Flam-mini, and Filippo Menczer.

2017. The spread of fake news by social bots.arXivpreprint arXiv:1707.07592(2017).

[21] Kai Shu, H. Russell Bernard, and Huan Liu. 2018. Studying Fake News via Net-work Analysis: Detection

and Mitigation.CoRRabs/1804.10233 (2018).

[22] Kai Shu, Suhang Wang, and Huan Liu. 2018. Understanding user profiles on social media for fake news

detection. In 2018 IEEE MIPR. IEEE, 430–435.

[23] Eugenio Tacchini, Gabriele Ballarin, Marco L Della Vedova, Stefano Moret, and Luca de Al-

faro. 2017. Some like it hoax:

Automated fake news detection in social networks.arXiv preprint

arXiv:1704.07506(2017).

[24] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online.Science359,

6380 (2018), 1146–1151.

[25] Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. defend: Explainable fake news

detection. In KDD 2019.

[26] William Yang Wang. 2017. ” liar, liar pants on fire”: A new benchmark dataset for fake news detection.

arXiv preprint arXiv:1705.00648(2017).

[27] Kai Shu, Deepak Mahudeswaran, and Huan Liu. FakeNewsTracker: a toolfor fake news collection,

detection, and visualization. In CMOT’18.

[28] Arkaitz Zubiaga, Alex Voss, Rob Procter, Maria Liakata, Bo Wang, and Adam Tsakalidis. 2017.Towards

real-time, country-level location classification of worldwide tweets.IEEE Transactions on Knowledge and

Data Engineering29,9 (2017), 2053–2066.

[29] Mustafa Alassad, Muhammad Nihal Hussain, and Nitin Agarwal. 2019. Finding Fake News Key Spread-

ers in Complex Social Networks by Using Bi-Level Decomposition Optimization Method. In International

Conference on Modelling and Simulation of Social-Behavioural Phenomena in Creative Societies

[30] Gisel Bastidas Guacho, Sara Abdali, Neil Shah, and Evangelos E Papalexakis. 2018. Semi-supervised

Content-based Detection of Misinformation via Tensor Embeddings. In ASONAM.

[31] Kai Shu, and Huan Liu. Detecting fake news on social media. In Synthesis Lectures on Data Mining

and Knowledge Discovery, 2019.

[32] Seyedmehdi Hosseinimotlagh and Evangelos E Papalexakis. 2018. Unsupervised Content-Based Identi-

fication of Fake News Articles with Tensor Decomposition Ensembles. (2018).

[33] Hamid Karimi, Proteek Roy, Sari Saba-Sadiya, and Jiliang Tang. 2018. Multi Source Multi-Class Fake

News Detection. In COLING.

[34] Kai Shu, Suhang Wang, and Huan Liu. Beyond News Contents: The Role of Social Context for Fake

News Detection. In WSDM’19.

[35] Hamid Karimi and Jiliang Tang. 2019. Learning Hierarchical Discourse-level Structure for Fake News

Detection. arXiv preprint arXiv:1903.07389 (2019)

16